Combining Labeled and Unlabeled Data for Text Classification with a Large Number of Categories

نویسنده

  • Rayid Ghani
چکیده

We develop a framework to incorporate unlabeled data in the Error-Correcting Output Coding (ECOC) setup by decomposing multiclass problems into multiple binary problems and then use Co-Training to learn the individual binary classification problems. We show that our method is especially useful for classification tasks involving a large number of categories where Co-training doesn’t perform very well by itself and when combined with ECOC, outperforms several other algorithms that combine labeled and unlabeled data for text classification in terms of accuracy, precision-recall tradeoff, and efficiency.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Combining Labeled and Unlabeled Data for MultiClass Text Categorization

Supervised learning techniques for text classi cation often require a large number of labeled examples to learn accurately. One way to reduce the amount of labeled data required is to develop algorithms that can learn e ectively from a small number of labeled examples augmented with a large number of unlabeled examples. Current text learning techniques for combining labeled and unlabeled, such ...

متن کامل

Using Error-Correcting Codes with Co-Training for text classification with a large number of categories

A major concern with supervised learning techniques for text classification is that they often require a large number of labeled examples to learn accurately. One way to reduce the amount of labeled data required is to develop algorithms that can learn effectively from a small number of labeled examples augmented with a large number of unlabeled examples. In this paper, we develop a framework t...

متن کامل

کاهش ابعاد داده‌های ابرطیفی به منظور افزایش جدایی‌پذیری کلاس‌ها و حفظ ساختار داده

Hyperspectral imaging with gathering hundreds spectral bands from the surface of the Earth allows us to separate materials with similar spectrum. Hyperspectral images can be used in many applications such as land chemical and physical parameter estimation, classification, target detection, unmixing, and so on. Among these applications, classification is especially interested. A hyperspectral im...

متن کامل

KDD Project Report Using Error-Correcting Codes for Efficient Text Classification with a Large Number of Categories

We investigate the use of Error-Correcting Output Codes (ECOC) for efficient text classification with a large number of categories and propose several extensions which improve the performance of ECOC. ECOC has been shown to perform well for classification tasks, including text classification, but it still remains an under-explored area in ensemble learning algorithms. We explore the use of erro...

متن کامل

Learning to Classify Text from Labeled and Unlabeled Documents

In many important text classification problems, acquiring class labels for training documents is costly, while gathering large quantities of unlabeled data is cheap. This paper shows that the accuracy of text classifiers trained with a small number of labeled documents can be improved by augmenting this small training set with a large pool of unlabeled documents. We present a theoretical argume...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001